Bulk-copy operations can occur in two modes: logged and nonlogged (also known as slow and fast bcp,
respectively). The ideal situation is to operate in nonlogged mode
because this arrangement dramatically decreases the load time and
consumption of other system resources, such as memory, processor use,
and disk access. However, the default runs the load in logged mode,
which causes the log to grow rapidly for large volumes of data.
To achieve a
nonlogged operation, the target table must not be replicated (the
replication log reader needs the log records to relay the changes
made). The database holding the target table must also have its SELECT INTO/BULK COPY option set, and finally, the TABLOCK hint must be specified.
Note
Remember that setting the SELECT INTO/BULK COPY
option disables the capability to back up the transaction log until a
full database backup has been performed. Transaction log dumps are
disabled because if the database had to be restored, the transaction
log would not contain a record of the new data.
Although you can still
perform fast loads against tables that have indexes, it is advisable to
drop and re-create the indexes after the data transfer operation is
complete. In other words, the total load time includes the loading of
the data and index creation time. If there is existing data in the
table, the operation is logged; you achieve a nonlogged operation only
if the table is initially empty.
Generally,
you get at least a 50% drop in transfer speed if the table has an
index. The more indexes, the greater the performance degradation. This
is due to the logging factor: more log records are being generated, and
index pages are being loaded into the cache and modified. This can also
cause the log to grow, possibly filling it (depending on the log file
settings).
Note
Despite the name, even a
nonlogged operation logs some things. In the case of indexes, index
page changes and allocations are logged, but the main area of logging
is of extent allocations every time the table is extended for
additional storage space for the new rows.
Batches
By default, bcp puts all the rows that are inserted into the target table into a single transaction. bcp calls this a batch.
This arrangement reduces the amount of work the log must deal with;
however, it locks down the transaction log by keeping a large part of
it active, which can make truncating or backing up the transaction log
impossible or unproductive. By using the bcp batch (–b)
switch, you can control the number of rows in each batch (or,
effectively, each transaction). This switch controls the frequency of
commits; although it can increase the activity in the log, it enables
you to trim the size of the transaction log. You should tune the batch
size in relation to the size of the data rows, transaction log size,
and total number of rows to be loaded. The value you use for one load
might not necessarily be the right value for all other loads.
Note that if a subsequent batch fails, the prior batches are
committed, and those rows become part of the table. However, any rows
copied up to the point of failure in the failing batch are rolled back.
Parallel Loading
A great enhancement of bcp
is that you can now use it to do parallel loads of tables. If you want
to take advantage of this feature, the following must be true:
Only applications using the ODBC or SQL OLE DB–based APIs can perform parallel data loads into a single table.
The procedure is
straightforward. After you ascertain that the target table has no
indexes (which could involve dropping primary or unique constraints)
and is not being replicated, you must set the database option SELECT INTO/BULK COPY to true.
The requirement to drop all indexes has to do with the locking that
must occur to load the data. Although the table itself can have a
shared lock, the index pages are an area of contention that prevents
parallel access.
Now all that is required is to set up the parallel bcp loads to load the data into the table. You can use the –F and –L switches to specify the range of the data you want each parallel bcp
to load into the table if you are using the same data file. Using these
switches removes the need to manually break up the file. Here is an
example of the command switches involved for a parallel load with bcp for the customers table:
bcp AdventureWorks2008.Sales.SalesOrderHeader IN SalesOrders10000.dat –T
–S servername –c –F 1
–L 10000 –h "TABLOCK"
bcp AdventureWorks2008.Sales.SalesOrderHeader IN SalesOrders20000.dat –T
–S servername –c –F 10001
–L 20000 –h "TABLOCK"
The TABLOCK hint (–h
switch) provides improved performance by removing contention from other
users while the load takes place. If you do not use the hint, the load
takes place using row-level locks, and this is considerably slower.
SQL Server 2008 allows parallel loads without affecting performance by making each bcp connection create extents in nonoverlapping ranges. The ranges are then linked into the table’s page chain.
After the table is
loaded, it is also possible to create multiple nonclustered indexes in
parallel. If there is a clustered index, you work with that one first,
followed by the parallel nonclustered index.
Supplying Hints to bcp
The SQL Server 2008 version of bcp
enables you to further control the speed of data loading, to invoke
constraints, and to have insert triggers fired during loads. To take
advantage of these capabilities, you use hint switches to specify one
or more hints at a time. Following is the syntax:
This option cannot be
used when bulk-copying data into versions of SQL Server before version
7.0 because, starting with SQL Server 7.0, bcp
works in conjunction with the query processor. The query processor
optimizes data loads and unloads for OLE database rowsets that the
latest versions of bcp and BULK INSERT can generate.
The following sections describe the various hints you can specify with the –h switch.
The ROWS_PER_BATCH Hint
The ROWS_PER_BATCH
hint is used to tell SQL Server the total number of rows in the data
file. This hint helps SQL Server optimize the entire load operation.
This hint and the –b switch heavily influence the logging operations that occur with data inserts. If you specify both this hint and the –b switch, they must have the same values, or you get an error message.
When you use the ROWS_PER_BATCH
hint, you copy the entire result set as a single transaction. SQL
Server automatically optimizes the load operation, using the batch size
you specify. The value you specify does not have to be accurate, but you should be aware of the practical limit, based on the database’s transaction log.
Tip
Do not be confused by the name of the ROWS_PER_BATCH hint. You are specifying the total file size and not the batch size (as is the case with the –b switch).
The CHECK_CONSTRAINTS Hint
The CHECK_CONSTRAINTS hint controls whether check constraints are executed as part of the bcp operation. With bcp,
the default is that check constraints are not executed. This hint
option allows you to turn the feature on (to have check constraints
executed for each insert). If you do not use this option, you should
either be very sure of your data or rerun the same logic as in the
check constraints you deferred after the data has been loaded.
The FIRE_TRIGGER Hint
The FIRE_TRIGGER hint controls whether the insert trigger on the target table is executed as part of the bcp operation. With bcp,
the default is that no triggers are executed. This hint option allows
you to turn the feature on (to have insert triggers executed for each
insert). As you can imagine, when this option is used, it slows down
the bcp load operation. However, the business reasons to have the insert trigger fired might outweigh the slower loading.
The ORDER Hint
If the data you want to load is already in the same sequence as the clustered index on the receiving table, you can use the ORDER hint. The syntax for this hint is as follows:
ORDER( {column [ASC | DESC] [,...n]})
There must be a clustered index on the same columns, in the same key sequence as specified in the ORDER
hint. Using a sorted data file (in the same order as the clustering
index) helps SQL Server place the data into the table with minimal
overhead.
The KILOBYTES_PER_BATCH Hint
The KILOBYTES_PER_BATCH
hint gives the size, in kilobytes, of the data in each batch. This is
an estimate that SQL Server uses internally to optimize the data load
and logging areas of the bcp operation.
The TABLOCK Hint
The TABLOCK hint is used to place a table-level lock for the bcp load duration. This hint gives you increased performance at a loss of concurrency, as described in the section “Parallel Loading,” earlier in this chapter.